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Abstract 

We provide a new approach to approximate emulation of large computer experi- 
ments. By focusing expressly on desirable properties of the predictive equations, we 
derive a family of local sequential design schemes that dynamically define the support 
of a Gaussian process predictor based on a local subset of the data. We further derive 
expressions for fast sequential updating of all needed quantities as the local designs 
are built-up iteratively. Then we show how independent application of our local design 
strategy across the elements of a vast predictive grid facilitates a trivially parallel im- 
plementation. The end result is a global predictor able to take advantage of modern 
multicore architectures, while at the same time allowing for a nonstationary modeling 
feature as a bonus. We demonstrate our method on two examples utilizing designs 
sized in the thousands, and tens of thousands of data points. Comparisons are made 
to the method of compactly supported covariances. 

Key vi^ords: sequential design, sequential updating, active learning, surrogate model, 
emulator, compactly supported covariance, local kriging neighborhoods 



1 Introduction 



The Gaussian process (GP) is a popular choice for emulating computer experiments (Santner 



et al. , 2003). As priors for nonparametric regression, they are unparalleled; they are rarely 
beat in out-of-sample predictive tests, have appropriate coverage, and have an attractive 
ability to accomplish these feats while interpolating the response if so desired. That said, 
GPs do have disadvantages, and our aim is not to make an exhaustive or even representative 
list of them here. But perhaps the two most common criticisms are that (1) computation 
(for inference and prediction) scales poorly as data sets get large, and (2) that best results 
require an assumption of stationarity in the data-generating mechanism, which may not be 
appropriate. The second issue is usually coupled with the first. Emulators relaxing the 
stationarity assumption exist, but often require further computational expense. Examples 
include learning mappings from the original input space to one where stationarity reigns (e.g. 
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Schmidt and O'Hagan, 2003, and references therein), and process convolution approaches 



that allow the kernels to vary smoothly in parameterization as an unknown function of their 
spatial location (e.g., Paciorek and Schervish, 2006, and therein). 



Several works in recent literature have made inroads in reducing the computational bur- 
den for stationary models, usually by making some kind of approximation. Examples include 
searching for a reduced set of "pseudo- inputs" (Snelson and Ghahramani, 2006), building 
up estimators iteratively in batches (e.g., Haaland and Qian, 2011), sequentially updating 



via particle methods tailored to design applications (Gramacy and Poison, 2011), fixed rank 



kriging (Cressie and Johannesson, 2008), and the use of compactly supported covariance 



(CSC) matrices and fast sparse linear algebra packages (Kaufman et al. , 2012 Sang and 



Huang 2012). Our contribution has aspects in common with all of these approaches and 
several others by association (e.g., Nychka et al. , 2002 Quifionero-Candela and Rasmussen 



2005 Furrer et al. , 2006). It is reminiscent of ad hoc methods based on local kriging neigh- 
borhoods (e.g., Cressie, 1991, pp. 131-134). What we propose is more modern, scalable, 
more easily automated, which makes it ideal for multicore computing, and tailored to com- 
puter experiments. It draws in part on recent findings for approximate likelihoods in spatial 
data (e.g.. Stein et al. , 2004), and active learning techniques (e.g., Cohn, 1996). 



We start by focusing on the prediction problem, locally, at an input x. In computer model 
emulation this is the primary problem of interest. We recognize, as many authors have before 
us, that data points far from x have vanishingly small influence on the predictive distribution 
(assuming the usual choices of covariance function). Thus, even though a large covariance 
matrix may have been calculated and inverted at great computational expense, not many 
of its rows/columns contribute substantively to the linear predictor. We focus on finding 
those data points (relative to x), and corresponding rows/cols, without considering the full 
matrices. In particular, we illustrate how a sensible objective criterion reduces to a key 
component of an active learning heuristic that has found recent application in the design of 
computer experiments literature (see, e.g.. 



Gramacy and Lee 2009). 



Our localized sub-designs differ from a local nearest neighbor (NN) approach, and yield 
more accurate predictions. This echoes a result noted by Stein et al. ( |2004| ) who observe that 
NNs may not be ideal for learning correlation parameters. We extend that to obtaining pre- 
dictions. Our predictions are robust, in a certain sense, to the covariance parameterization. 
First, they are applicable to any covariance so long as derivatives can be calculated. Second, 
we deliberately choose a very naive one and nonetheless compare favorably out-of-sample to 
others having more thoughtful choices (e.g., Kaufman et al. , 2012). The explanation is our 



local inferences imply a global nonstationary effect that compensates for overly smooth and 
simplistic local structure. 

There are strong parallels between our suggested approach and CSC, although we are not 
applying any kind of tapering. We think of tapering as a continuous or smooth operation 
whereas our approach is more discrete. An advantage of our approximation is that accuracy 
can be explicitly linked to a computational budget — we know exactly how "sparse" our 
matrices are going to be before running the code, and we can monitor convergence to the 
(intractable) full-data counterparts. Importantly, no special sparse linear algebra routines 
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are need; we use only standard dense-matrix ones. Finally, since we focus on particular 
locations x, independent of others, our method is highly parallelizable. Parallel computation 
at a large set of x locations can proceed without communication between nodes, which means 
that trivial application of OpenMP pragmas yields nearly-linear speedups. 

The remainder of the paper is outlined as follows. In Section [2] we review GP predictive 
modeling, inference and prediction. In Section |3] we derive a new sequential design criteria for 
greedily building local designs by considering future mean-squared prediction error (MSPE). 
We also derive expressions for the efficient updating of all required quantities in order to 
keep computational costs down, and argue that a simpler heuristic related to one from 
the active learning literature may prove to be even more efficient. Then in Section 4.2 we 



consider the global prediction problem, for many x, and discuss global inference for the 
correlation structure. Finally, all of the elements are brought together in Section |5] for 
empirical illustrations and comparisons to CSC. We conclude with a discussion in Section [6] 



2 Gaussian process predictive modeling 

A GP prior for functions F : — )■ M, where any finite collection of outputs are jointly 
Gaussian (Stein, 1999[ ), is defined by its mean jj,{x) = E{F(x)} and covariance C{x,x') 



E,{[Y {x) — iJ,{x)] [{Y {x') — fi{x')]^]} . A common simplifying assumption in the computer mod- 
eling literature is to take the mean to be zero, and this is the convention we will follow. Typ- 
ically, one separates out the variance in C{x,x') and works with correlations K{x,x') = 
T~'^C{x, x') based on Euclidean distance and a small number of unknown parameters. In this 
paper we focus on the isotropic Gaussian correlation K0^ri{x,x') = exp{ — \\x — x'W"^ / 9} + ri6x,x' 
where 6 is called the lengthscale parameter, and t] the nugget. 

In keeping with trends in the computer experiments literature, we hold at a pre- 
determined small value in order to obtain a near-interpolative surface which is appropriate 
when the computer model is deterministic. It is important to note that our main results 
do not heavily depend on of the choice of correlation function so long as derivatives (with 
respect to 6) are available. The isotropic, one-parameter [6) family simplifies the exposition. 
However the derivations of our main results, provided in Appendix [B| assume a separable 
family for generality. 

Although technically a prior over functions, the regression perspective recommends a 
likelihood interpretation. Using data D = {X,Y), where X is a x p design matrix, 
the N X 1 response vector Y has a multivariate normal (MVN) likelihood with mean zero 
and covariance = t'^K, where K is an N x N matrix with {ijY^ entry K{xi,Xj). 

Conditional on K, the MLE of is available analytically. Profile likelihoods may then be 
used to infer other unknown parameters (e.g., 6). 



Empirical Bayes (Berger et al. , 2001 ) is more popular these days. One specifies a reference 
prior for the variance component, 7r(r^) oc r~^, and integrates it out. An MVN likelihood 



3 



and prior which is algebraically equivalent to IG(0, 0) makes integration analytic. We obtain 
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where = R-^Y. 



Numerical methods for finding parameters like 6 generally work well when leveraging deriva- 
tive information for the log likelihood, quoted here for convenience in later developments; 
derivations are included in Appendix [B] 
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where k and k are shorthand for ^ and respectively. These can be used as the basis 
of a Newton-Raphson scheme, but complications arise in the not uncommon situation where 
the likelihood is multimodal (e.g., Rasmussen and Williams, 2006 Chapter 5). 

The marginalized predictive equations for GP regression are available in closed form. 
Specifically, the distribution of the response Y{x) conditioned on data D and covariance 
R{-, ■), i.e., p{y{x)\D, R), is Student-t with degrees of freedom N, 



mean 



and scale 



^{x\D,R) = k^{x)R-% 



a\x\D,R) 



ij[R{x,x) - k'^{x)R-^k{x)] 



N 



(4) 
(5) 



where k~^{x) is the A^- vector whose i"" component is R{x,Xi). Using properties of the 
Student-t, the variance of Y{x) is V{x) = Var[r(x)|D] = a^{x\D, R) x A^/(A^ - 2). 



3 Localized approximate emulation 

If obtaining fast approximate prediction at x is the goal, then a first stab may be to use a 
small sub-design of X locations closest to x. I.e., choose the n-nearest neighbor (NN) sub- 
design Xn{x) C X and apply the predictive equations (|4]-[5| with Dn{x) = (X^,!^) where 
n is chosen as large as computational constraints will allow. This is a sensible approach. 
It is clearly an approximation: as n N, predictions of Y{x)\Dn and Y{x)\D converge 
trivially. Moreover, since the predictive equations are valid for any design, interpreting 
the accuracy of the approximation involves the same quantities as those obtained from an 
analysis based the on full design. Clearly, when n <^ N, the variance V{x)\Dn can be much 
larger than V{x)\D. However, if A^ is so big as to preclude obtaining any predictor at all, 
within reasonable computational constraints, obtaining a predictor with marginally infiated 
uncertainty is better than nothing. 
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Though simple, one naturally wonders if it is optimal given computational constraints 



n). It is not: Vecchia (1988) observed that the best sub-design for prediction depends on the 



correlation parameters {6) and is usually not the NN design. Stein et al. (2004) go further by 
showing that NN is uniformly suboptimal under certain regularity conditions. Still, finding 
the best design for n > 1 is computationally difficult. It would involve a high-dimensional 
non-convex optimization — a computational nightmare in its own right. Joint inference with 
6 is more difficult still, compounding the issue. Although the simple yet sub-optimal NN 
idea seems attractive again, the main goal of this paper is to demonstrate that it is possible 
to do better without much extra effort. 

The general framework, which is akin to forward stepwise variable selection in regression, 
is as follows. For a particular predictive location x, we search for a sub-design X„(x), 
with accompanying responses Yn{x), together comprising data Dn{x), by making n greedy 
decisions Dj^i{x) = Dj{x) U {xj+i,y{xj+i)) for j = 1,. . . ,n. Each choice of Xj+i depends 
on the previous choices saved in Dj{x) by searching over a decision criterion. Evaluating 
the criterion, and updating the quantities required for the next iteration, must not exceed 
O(j^) so that the total scheme remains O(n^). We start with a very small Dq{x) comprised 



of NNs, which is easy to motivate in light of results shown in Section 3.4 



3.1 Greedy search to minimize mean-squared predictive error 

Given x and Dj{x), we search for Xj^i by considering its impact on the predictive variance 
of Y{x), taking into account uncertainty in 6, through an empirical Bayes mean-square 
prediction error (MSPE) criterion: J{xj+i,x) = K{[Y{x) — 6'j+i)]^|Dj(a;)}, which 
should be minimized. The predictive mean Hj+i{x;9j+i) follows (j4|, although we embellish 
notation here to explicitly indicate dependence on xj+i and the future /unknown yj+i- This 
mean represents the hypothetical prediction of Y{x) arising after choosing Xj+i, observing 
Uj+i, and calculating the MLE 6'j+i based on the resulting Dj+i{x). Where convenient we 
drop the x argument in the local design, Dj = Dj{x). Also note that 6j = 6j{x) since it 
depends on Dj{x). 

In Appendix |B] we derive the approximation 

J{x,^„x) ^ V,{x\x,^r-e,) + (6) 

The terms in (|6| are explained below in turn. 

• Vj{x\xj^i] 6) is an estimate of the new variance that will result after adding x^+i into 
Dj, treating 9 as known. 

Vj{x\xj+i\e) = ^j^^'^^ Vj+i{x;e), 
where Vj+i{x;9) = [Kj+i{x,x) - kJ^^{x)Kr_l-^kj+i{x)] (7) 
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is capturing Xj+i's contribution to the reduction in variance by placing its correlations 
to Xj in the j + 1*^* row/col of Kj^i. The correlation between x and Xj+i(2;) = 
Xj+i) comes in through the j + 1^* entry of /cj+i, the rest being kj. The other 
part of the expression, involving an estimate of via i/jj/j from Dj, deliberately avoids 
using i/j+i but it recognizes its effect as an additional design point in the denominator. 

^^gg'^** is the partial derivative of the predictive mean (4) at x, given Dj, with respect 
to the lengthscale parameter. In the appendix this is derived as 

^^^^ = Kr%ix) - k,Kr%ix)VY,, (8) 

where kj{x) is the j-length column vector of derivatives of the correlation function 
K{x,Xk), k = 1, . . . , j , with respect to 6. 

Qjj^i{6) is the expected Fisher informatioiij^ comprised of the information from Dj plus 
the expected component from the future yj+i at Xj+i. 

Q^^m = F,{e) + E {-^-^^^^l^\Dr, (9) 

_ p/m I 1 f dVj{xj+i;e) y 1 / dfij{xj+ ,;ey ' 

2V,{x,+^,9y \ de ) ^ V,(x,+i;0) V 06 

where Fj{e) = -i"iYj;9) from Q, Vj{xj+i;9) = V{xj+i\Dj;0) following Q, and 
dV,ix;e) _Yj'Kr%Y, 



86 j-2 



K{x,x) — k- {x)kj{x) 



kj{x)kj{x) + kj{x) {kj{x) — lCjkj{x)) 



where /Cj = KjKj ^ and kj{x) = Kj ^kj{x). 

The approximation in ([6]) is three-fold: (a) the partial derivative of jJij+i is approximated 
by that of /i^; (b) the Student-t predictive equations are approximated by moment-matched 
Gaussian ones (fe); and (c) the future 6'j+i is replaced by the current estimate 6j based 
on the data Dj obtained so far. All are required since averaging over future expected Vj+i 
is analytically intractable. Even though a Uj+i is technically available in our context, we 
deliberately do not condition on this value, which would result in the undesirable avoidance 
of selecting whose Uj+i is in disagreement with the previously selected sub-design. 

[See Appendix |B] for further discussion.] 

Observe that when the Fisher information is large, the MSPE d6|) is dominated by the 



reduced variance for Xj+i, the first term. We return to this special case in Section 3.3 For 



^The information is a scalar for scalar 6 in this development, however the derivations in the appendix 
apply more generally. 
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smaller j the extra term gives preference to choosing an Xj+i that is expected to improve 
estimates of 6 nearby to x. Considering that a sensible non-local design strategy is to 
maximize (the determinant of) the Fisher information, the MSPE is thus making a local 
adjustment by (a) weighting the Fisher information by the rate of change of the predictive 
mean near x and (b) balancing that with the aim of reducing local predictive variance. The 
appropriate balance is automatically dictated by the definition of J{xj^i, x), which considers 
the effect of uncertainty in 6 on the predictive uncertainty in Y{x). Aspect (a) is suggestive 
of potential benefit in estimating a localized, i.e., nonstationary, spatial field. 



3.2 Fast updates 

MSPE is deployable because everything it requires can be obtained by fast updating schemes 
from one iteration to the next: j — )■ j + 1. Below, we outline updates that require time in 
O(j^) so that n applications are in 0{n^). This is possible since K~_^-^ can be obtained 
efficiently from K~ ^, both conditional on the same parameter 9, via the partition inverse 
equations (Barnett 1979 Appendix |A|) . We use them in several applications. For example. 



Vj{x;9)-vj+^ix;e) (10) 
= kj {x)Gj{xj+i)mJ^{xj^i)kj{x) + 2kJ {x)gj{x')K{xj+i, x) + K{xj+i, x)'^mj{xj+i), 

where Gj{x') = gj{x')gj {x'), 

gj{x') = -mj{x')K-^kj{x') and mj\x') = Kj{x',x') - kj {x')Kr^kj{x'). (11) 

Each product above requires most an O(j^) time. Adding a row and column kj{xj+i) to Kj 
which, together with K{xj+i,Xj+i) = 1+7], yields i^j+i in 0{j) extra time and space. In 
addition to aiding in computation, Eq. (10) helps clarify how variance is reduced at x as a 
function of Xj+i. The i^(xj+i, x)^ term is suggestive of a large reduction the closer x is to 
Xj+i, however kj{x) suggests that distances from x to Xj{x) are also a factor. 

The log marginal likelihood requires logli^^l which, by similar considerations, may be 
updated in 0{j) time as follows: 

\og\Kj+i\ = \og\Kj\ + K{xj+i,Xj+i) + gj {xj+i)kj{xj+i)/mj{xj+i). 

Fast updates are also available for key aspects of the predictive equations (|4]-[5]): 

V hj{xj+i) +yj+imj{xj+i) J 

and ipj+i = tpj + hj{xj+if /mj{xj+i) + 2yj+ihj{xj+i) + y'^^^mj{xj+i), 

where hj{xj^i) = gj{xj+i) Both updates take time in 0{j). 

For fast updates of the observed information Fj^i{6), write the log likelihood as a sum 
of components from Dj and from (xj+i, ?/j+i), with the latter denoted by ij{yj^i;9). This 
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suggests the simple recursion Fj^i{9) = Fj{6) 



In Appendix 



B 



we show that 
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For compactness we are suppressing arguments to the mean and variance functions, and 
shorthanding their derivatives. E.g., = Vj{xj^i;6) and jlj = ^ ^^^a''/^'^'* - Expressions for 
second derivatives of predictive quantities may be found in the appendix. Observe that the 
expectation (given Dj) of the above quantity augments Fj{9) to complete Qjj^iiO) in 



Updates of all other quantities required for the calculations in Section follow trivially. 
Since they require fixed 6', our greedy scheme is only efficient if MLEs are not recalculated 
after each sequential design decision, suggesting further approximation is warranted to obtain 
a thrifty MSPE-based local design. We find that working with a reasonable fixed value of 
9 throughout the iterations works well. At the end of the n local design and update steps 
we can find the MLE 6{x) by Newton-like methods with analytic derivatives (bfpl, which 
would incur O(n^) cost once. More details on local inference are deferred to Section 4.2 



3.3 A special case 



Recall that when the Fisher information is large, the MSPE ^ reduces to Vj{x\xj+i\9), 

is added into the design [Section |3.1| . A similar 

but 



which is the new variance at x when Xj+i 



statistic has been used in the sequential design of computer experiments before, Dut m 
a different context: choosing design locations at which to run new computer simulations 

Rather than focusing on a particular 



le.g. 



Seo et al. 2000; Gramacy and Lee, 2009). 



predictive location. 



consider averaging reductions in variance over the whole input space 
> j{xj+i] 6) = Vj{x] 6) — Vj{x\xj^i; 6) dx. Part of the integrand does not depend on 
Xj+i, and so can be ignored. The other part is proportional to Vj+i{x]9) in (10). In the 
special case where K{-, ■) yields the identity (Anagnostopoulos and Gramacy, |2012p , or when 



A* is a finite and discrete set, the integral is also analytic. Otherwise numerical methods are 
needed, which may add significant computational burden. 

Choosing Xj+i to maximize AVj{xj+i;6) is a sensible heuristic since it augments the 
design with an input-output pair that has the most potential to reduce variance globally. In 
repeated application, j = 1, . . . ,N, the resulting designs have been shown to approximate 
maximum information designs (Cohn, 1996). To credit its originator, subsequent authors 



have referred to this technique as active learning Cohn (ALC). However, most note that 
the extra computational burden required for numerical integration is not justified relative to 
alternative design schemes. 

A local design criteria based on the integrand alone — the simplest implementation of 
which involves maximizing (10) — may therefore be a sensible alternative to MSPE, not just 
further approximation. On the one hand, it leverages an information theoretic criteria. On 
the other it shortcuts a great deal of computational effort and memory involved in MSPE 
since derivatives are not needed. In what follows we overload the terminology a bit and refer 



to local design based on maximizing (10) as ALC. However, it is important to note that its 
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application in this context is novel. ALC in its original form has never been used to select a 
local sub-design and, since no integration is needed (numerical or otherwise) the computation 
is straightforward. In Appendix [C] we show that this heuristic can be derived by studying 
partial correlations akin to those used to derive sequential least squares estimators. 

3.4 An illustration 

Consider a surface first studied by |Gramacy and Lee (2011b), 

/(xi,X2) = —w{xi)w{x2), where 

w{x) = exp [—{x — 1)^) + exp (-0.8(a; + 1)^) - 0.05 sin (8(a; + 0.1)) 

generating a data set of x-y pairs on dense 201 x 201 (= 40401 point) regular grid in [—2, 2]. 
Figure [T] shows local designs for two input locations, x, under MSPE and ALC criteria. 
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Figure 1: A comparison between ALC and MSPE heuristics at two predictive locations The 
numbers plotted indicate the order in which the design sights were chosen. 

Both share the same initial design Dq comprising the closest six NNs to x. Although the 
two heuristics agree on some sights chosen, observe that they are not selected in the same 
order after iteration nine. The patterns diverge most for further out local design choices but 
the qualitative flavor of the pattern of designs is similar. In both cases, the bounding box 
[—2, 2]^ impacts the symmetry of the local design. 

The full 46 local design iterations take less than a tenth of a second on a modern iMac, 
although in repeated trials (to remove OS noise) we found that MSPE takes 2-3 times longer. 
Inverting a 407^^ x 407^^ matrix cannot, to our knowledge, be done on a modern desktop 
primarily due to memory swapping issues. For a point of reference, inverting a 4000 x 4000 
matrix took us about seven seconds using a multithreaded BLAS/Lapack. Memory issues 
aside we deduce that the alternative full-GP result would have taken days. 
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Accuracy and Actual Precision of Estimator 



Predicted GP Precision 




Figure 2: Comparing the NN predictor to the greedy selection approach in 1000 repeated 
repetitions an LHS of size 10001: lOK train/1 test. Both methods are initiahzed with an 
NN design \Dq\ = 6. The left pane shows the actual bias and precision and the right shows 
the predicted precision (via the square-root of the GP predictive variances). Each repetition 
is shown in light shades; the darker dashed lines are 90% quantiles, and the dark solid line is 
the average. The open-circles with whiskers at j = 52 are the the 90% quantile and average 
results from a size 1000 NN design. 



Figure |2] illustrates the accuracy of the approximate predictive distribution thus obtained, 
with comparison to the NN alternative. These results summarize the output of a Monte Carlo 
experiment repeated 1000 times, whose setup is similar to that above with the following 
exception. In each repetition we generated a Latin Hypercube sample (LHS; [McKay et aH 



1979) of size 10001, and treated the first lOK as training {X,Y) and the last one as our 
testing {x,y). We then obtained the predictive equations under NN and greedy/MSPEj^ 
local designs (to x) of size 50, where both share a starting NN design Dq of size six. Each 
lightly-shaded line in the plot represents one Monte Carlo iteration. The dark-dashed lines 
outline point-wise 90% intervals, and the dark-solid lines show the point-wise averages. The 
open circles and whiskers at 52 on the j-axis provide an approximate benchmark for the best 
possible results. It is based on a NN local design of size 10000 

The left pane shows the bias, defined as the predictive mean /i„(x) minus the true ?/- value. 
Notice how, relative to the greedy/MSPE method, the NN approach is slow to converge to 
the true value and is consistently biased low after the initial design. Both start biased-low, 
which we attribute to 0-mean reversion typical (for CPs) in areas of the input space sparsely 



^The results for ALC are very similar. 

•^This is the largest design we could manage with reasonable computing effort (when repeated 1000 times). 
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g=1e-3 



g=1e-6 




Figure 3: Condition numbers versus design size for NN and greedy methods for two different 
values of the nugget, rj. 



covered by the design. The right pane shows the predictive standard deviation over design 
iterations. Notice that the NN standard deviations are consistently lower than those from 
the greedy/MSPE method. So the NN method both biased-low and more confident than 
its comparator. Finally, observe that both methods obtain a predictive mean close to the 
truth at \Dn\ = 50, with lower variability for the greedy/MSPE method. Moreover, both 
adequately acknowledge uncertainty inherent in the approximation obtained using a much 
smaller local design compared to one obtained with orders of magnitude larger designs. 

A knock-on effect of the spacing of the sub-design points found by the greedy method 
is that the condition numbers (ratio of the largest to smallest eigenvalue)]^ of the resulting 
covariance matrices, Kj, are lower relative to those of the NN method as j becomes large. 
Figure |3] provides an illustration. The left plot corresponds to exactly the setup giving rise 
to the right panel Figure [T| using ALC; whereas the right plot uses a smaller nugget value, rj. 
Lower condition numbers indicate greater numerical stability and thus more reliable matrix 
decompositions. This is relevant for GP surrogate modeling since it means that smaller 
nuggets, and thus truer interpolations, can be achieved with the greedy method if so desired. 
Although some recent papers have argued that the quest for truer interpolation in surrogate 



modeling may not be efficient from a statistical perspective (Gramacy and Lee, 2011a Peng 



and Wu, 2012), seeking out the smallest possible nugget while retaining numerical stability 



remains an aspiration in the literature (Ranjan et al. , 2011). 



4 Global emulation on a dense design 

The simplest way to extend the analysis to cover a dense design of predictive locations 
X G A* is to serialize: loop over each x collecting approximate predictive equations, each 

^We used the kappa function in R with default arguments. 
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in 0{n^) time. For T = \X\ the total computational time is in 0{Tn^). Obtaining each 
of the full GP sets of predictive equations, by contrast, would require computational time 
in OiTN"^ + A^^), where the latter A^^ is attributable to obtaining -R'"^]^ One of the nice 
features of standard GP emulation is that once has been obtained the computations 
are fast 0{N'^) operations for each location x. However, as long as n ^ our approximate 
method is even faster despite having to rebuild and re-decompose Kj{xys for each x. 



4.1 Parallelization 

The approximation at x is built up sequentially, but completely independently of other 
predictive locations, even ones nearby. Since a high degree of local spatial correlation is a 
key modeling assumption this may seem like an inefficient use of computational resources, 
and indeed it would be in serial computation for each x. However, this independence allows 
trivial parallelization. Predicting at a dense design of lOK locations on an 8-core iMac, 



otherwise using the same 40K design setup as in Section 3.4 but with a fixed 9 = 0.1 for 
all X, took about 30 seconds to execute and required only token programmer effort: we 
augmented a single C "for" loop over x ^ X with an OpenMP "pragma": 

#ifdef _OPENMP 

#pragma omp parallel for private (i) 
#endif 

for(i=0; Knpred; i++) { ... 

That's actual code from our implementation — and the full extent of our modifications for 
parallelization — where npred = \X\. This led to a nearly linear speedup: runtimes for P 
processors scale roughly as 1/-P. 



Illustration continued 

Figure |4] offers visualizations of the behavior of our greedy, local, approximations applied 
globally in the input space. The left column contains surfaces for both input dimensions, 
whereas the right one shows a Id slice for X2 = 1. The mean surfaces [to'p row) exhibit 
essentially no visual discrepancy to the truth. There are, however, small discrepancies {mid- 
dle row) and the level of accuracy is not uniform across the input space. Pockets of higher 
(though still very small) inaccuracy are visible. Finally, despite the uniform gridded design, 
the predictive standard deviation {bottom row) is very non- uniform throughout the input 
space, appearing to smoothly vary albeit across a narrow range of values. This subtly non- 
stationary behavior is due to the localized estimation of via ijj{x). It is not due to 6 since 
that was fixed globally. Closer inspection reveals that areas of lower predictive variability 
{bottom row) correspond to greater inaccuracy {middle). Both results suggest that the lo- 
calized greedy approximation is struggling to cope with the assumed stationary covariance. 
The predictor (s) may benefit from a more deliberate localized approach to the estimation of 
spatial correlation. 

^If only the predictive mean is needed, and not the variance, then the time reduces to 0{TN + N^). 
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Figure 4: The left column shows 2d visuahzations of the posterior mean surface fi{x\Dn), 
difference between the mean and the truth /x(x|-D„) — y{x) and the predictive standard 
deviation ^yV{x\Dn)■ The right column shows the same for the slice X2 = 0.5. 
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4.2 Localized parameter inference and stagewise design 

Wishing to leverage fast sequential updating, retain independence in calculations for each 
predictive location x so they can be parallelized, and allow for local inference of the correla- 
tion structure for nonstationary modeling, we propose the following iterative scheme. 

1. Choose a sensible starting global lengthscale parameter 6x = Oo for all x. 

2. Calculate local designs Dn{x,9x) based on sequential application of the MSPE design 
heuristic (|6|, independently for each x. 

3. Also independently, calculate the MLE lengthscale 6n{x)\Dn{x,6x) thereby obtaining 
a globally nonstationary predictive surface. 

4. Set 6x = On{x) possibly after smoothing spatially over all x locations. 

5. Repeat steps 2-4 as desired. 

6. Obtain predictive equations at each x, independently, based on Dn{x) and possibly 
smoothed 6x- 

Usually only one repetition of steps 2-4 is required for joint convergence of local designs 
Dn{x) and parameter estimates 6n{x), for all locations x. The initial is key, however, 
since pathological values can slow convergence. We recommend choosing 6o midway between 
the average minimum squared distance and average mean squared distance between all pairs 
of design points. 



MLE calculations in Step 3 proceed via Newton-like methods as described in Section 3.2 
We take the previous value 6x for initialization which may be in the first stage or a possibly 
smoothed 6n{x) in later stages. Convergence is very fast in the latter case (1-4 iterations) 
as we illustrate below. The former can require about twice as many iterations depending on 
^0, however this can be reduced significantly if a nearby 6n{x') can be used for initialization 
instead. As a share of the computational cost of the entire local design scheme, the burden 
of finding local MLEs is dwarfed by the MSPE calculations when the global design (i.e., 
the candidate locations for the local designs) is large, iV 3> n. So reducing the number of 
Newton steps may not be a primary concern in practice. 

The smoothing suggestion in Step 4 stems from both pragmatic and modeling consider- 
ations. CP likelihood surfaces can be multimodal so the Newton method is not guaranteed 
to find a global mode. Smoothing can offer some protection against rapid mode-switching 
in local estimators. On the modeling end, it is sensible to posit a priori that the correlation 
parameters be constrained somewhat to vary slowly across the input space. As we illustrate 
below, a simple neighborhood scheme leads to nice (smooth) visualizations of spatially vary- 
ing lengthscale parameters. We note that this step is not absolutely necessary to obtain 
good prediction results, though it may aid in convergence of the two-stage scheme. 
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Illustration continued 



Figure 3 shows the estimated log local lengthscale parameters on our illustrative example 
from Section [3] using ALC; the MSPE results are similar but require 50% more computational 
effort. The first pane shows a snapshot after the first stage; the second after the second stage. 
These have been smoothed using loess in R with a span of 0.01, giving non-negligible weight 
to only the twelve or so NNs in our particular predictive grid of x values. Notice the increase 



stage 1 smoothed theta-hat stage 2 smoothed theta-hat 




-1 1 -1 1 0.2 0.4 0.6 0.8 1.0 

xl xl theta-hat 



Figure 5: Smoothed log^„,(x) values after the first and second stages (reds are lower values), 
followed by a histogram of the of those values from the second stage. 

in fidelity from one stage to the next. The variation the correlation behavior over the 
input space is unmistakable after either stage. Observe the similarity to the predictive error 
and variance results from Figure [4], with larger lengthscales corresponding to areas poorer 
emulation under the model with a fixed, global, lengthscale. The final pane in Figure |5] 
shows a histogram of the locally estimated lengthscale parameters obtained after the second 
stage. The Newton steps from the first stage required about 6.5 iterations on average, with 
90% between 6 and 8 whereas the second stage took only 4.2 (3 to 6) when primed with the 
previous stage values. 



5 Empirical demonstration and comparison 



For more compelling illustration and comparison, consider the borehole function (Worley 
1987) which is a common benchmark for computer experiments (e.g. Morris et al. , 1993). 



The response y is given by 



Hi 



log 



1 + 



2LTu 



+ 



(12) 
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The eight inputs are constrained to he in a rectangular domain: 



G [0.05, 0.15] r G [100, 5000] T„ G [63070, 115600] Ti G [63.1, 116] 
Hu G [990, 1110] Hi G [700, 820] L G [1120, 1680] G [9855, 12045]. 

Part of our reason for choosing this data is that it was used as an iUustrative example for the 



CSC method described by Kaufman et al. (2012), which may be the most widely adopted 



modern approach to fast approximate emulation for computer experiments. The authors 
provide a script illustrating their R package sparseEM on this data. 



http : //www. Stat .berkeley. edu/~cgk/rcode/assets/SparseEmExample .R] 



In the first part of our comparison below we follow that script verbatim, except augmenting 
their 99% sparse estimator with a 99.9% one for faster results. 

They generate a LHS of size 4500 and use the first 4000 for training and the rest for 
testing. Nash-Sutcliffe efficiency is used to evaluate the predictive accuracy of their method. 

NSE = 1 - ~!f ''^"'^ / ; ^ ^- (13) 



Here, Y{x) represents the predicted value of Y{x), which for all models under comparison 
will be the posterior predictive mean; Y is the average value of Y{x) over the design space. 



The second term in (13) is estimated predictive mean-squared error (i.e., that which we 
approximately optimize over when calculating a local MSPE design) over the unstandardized 
variance of Y{x). Therefore NSE can be interpreted as an out-of-sample analog of R^. 

We compare several variations of our method, comprised of local designs via NN, and 
greedy ones following MPSE and ALC criterion. All methods are initialized with 6o = 0.7 
which is approximately the midpoint between the average minimum squared distance and 
average squared distance between all pairs of design points. We also consider a second 
iteration of MSPE and ALC design initialized with MLEs 6{x) (without smoothing) from 
the first stage. We consider NN with 9{x) too, ex post, but note that second stage local NN 
designs would be identical to those from the first stage since NN do not depend on 6. Other 
particulars are exactly the same as in our earlier illustrative example: the local methods 
start with |Do(3^)| = 6 NNs, and grow iteratively until |D„(x)| = 50. Finally, we repeat 
each experiment thirty times, in a Monte Carlo fashion, with new random LHS training and 
testing sets. 

Table [l| summarizes the results of two comparisons. One (left) for the original borehole 
comparison, and another (right) with double-sized training and testing sets. The timings, 
obtained on the same 8-core iMac throughout, comprise of the sum of fitting and prediction 
stages. OpenMP pragmas were used to parallelize our local design methods. Although CSC 
is not explicitly multi-threaded, the sparse linear algebra libraries it uses (from the spam 
package) are threaded, and therefore do utilize multiple cores g We can see clearly from 

®We observed between 1 and 5 threads being used by sparseEM, however only about 125% of the total 
800% of available computing resources were being used. 
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N = 4000, TVpred 


= 500 




N = 8000, A^pred 


= 1000 


method 


sees 


NSE 


p-val 


method 


sees 


NSE 


p-val 


mspe2 


+58.34 


0.99957 


0.41321 


alc2 


+75.02 


0.99965 


0.01324 


alc2 


+37.58 


0.99957 


0.00000 


mspc2 


+ 119.66 


0.99964 


0.00000 


mspe 


58.49 


0.99913 


0.02730 


mspe 


117.56 


0.99929 


0.11301 


ale 


37.59 


0.99911 


0.03470 


ale 


75.26 


0.99928 


0.00008 


csc99 


6961.17 


0.99906 


0.00000 


csc99 


27161.04 


0.99920 


0.00000 


esc999 


364.71 


0.99887 


0.00000 


CSC999 


2805.15 


0.99890 


0.00000 


nn 


0.91 


0.99581 


0.00486 


mspe.nomle 


115.77 


0.99705 


0.00000 


mspe.nomle 


58.02 


0.99551 


0.13011 


ale.nomle 


73.40 


0.99692 


0.00000 


ale.nomle 


36.85 


0.99546 


0.00000 


nn 


2.15 


0.99629 


0.00000 


nn.nomle 


0.20 


0.97337 




nn.nomle 


0.70 


0.98007 





Table 1: Average timings (in seconds) and accuracy (in NSE) values from two thirty- fold 
Monte Carlo experiments on the borehole data. The left table echoes the experiments for 
CSC; the right table is for a double-sized experiment. The third cohimn in both tables 
contains p- values from one-sided t-tests of adjacent performers (better v. next-best). The 
"+" next to the second-stage timings indicate that number reported is just for that stage. 
The total time is the sum of the first and second stages. 

the table(s) that even without parallehzation (8x slower) our local methods are orders of 
magnitude faster than the CSC ones. 

Differences in the raw accuracy numbers, via average NSE, are less stark at first glance. 
The rows in the table are ordered from most to least accurate. The third column gives a 
p-value for a one-sided paired t-test under the alternative that that row's NSEs came from a 
different population than those with the next- lowest averages (next row down). For example, 
at the 5% level, the best method ("mspe2") is not statistically better than the second best 
("alc2"), and whereas "alc2" is convincingly better than the third best ("mspe"), and so on. 
At a high level, observe that the local methods without 6{x) fare worst. NN is worse than 
the greedy methods across the board. Due to lack of statistical significance in differences 
between MSPE and ALC, the former might not be worth the added computational expense. 

Focusing on the lengthscale parameter, the greedy methods calculating 9{x) found ones 
around 7.5 whereas NN found ones around 5.5. The standard derivations were about 1.8 and 
1.4 respectively. Clearly NN, while extremely fast to compute, is incurring some bias towards 
more wiggly/less smooth predictive surfaces by not choosing any local design locations far 
away from x. The 9{x) values obtained for MSPE and ALC were very similar, although it 
is notable that second stage values had a slightly wider spread of ^(a;)s. 

The double-sized experiment in the right table tells a similar story. Accuracy results 
increase slightly, but relative orders are similar. What is noteworthy is that the local methods 
require, as expected, about double the computational effort. By contrast the CSC methods 
take four to eight times longer, suggesting a quadratic scaling in N. Anecdotally we remark 
that the 99% sparse CSC method on (N, n) — (8000, 1000) approaches the hmit of the size of 
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problem possible on this machine due to memory constraints. We tried N — 10000. Frequent 

memory pages to disk lead to dramatic slowdowns; our estimates suggest the code would 
have taken days to finish. By contrast, the local/greedy methods required seconds more. 

6 Discussion 

This paper outlines a family of sequential design schemes to modernize an old idea of ap- 
proximate kriging based on local neighborhoods. The result is a predictor that is fast, 
nonstationary, highly parallelizable, and whose "knobs" offer direct control on speed versus 
accuracy. Really the only tuning parameter is n, the size of the local design, which should be 
set as high as computational time/costs allow. By saving a small amount of information — 
indices of the local designs at each x, and corresponding 9{x) values — designs can augmented, 
picking up where they left off if more computational resources become available. Those re- 
sources do not need to be allocated uniformly. Larger local designs can be sought where, 
e.g., the predictive variance is larger. They need not apply the same greedy search method 
throughout. One could start with NN, continue with MSPE, and finish with ALC, which 
would be a sensible progression in light of earlier discussions. 

Having a stopping criterion might be desirable for situations in which computational 
limits are less constrained. One option is to monitor the ratio of the predictive and re- 
duced variances {Vj{x) /Vj^i{x)) over the iterations, and stop beyond a certain minimum 
threshold. Both depend on ipj, so cancellation would lead to a metric independent of the 
responses (other than via a common estimate of 9). However, non- uniform full-data 

designs coupled with spatially varying choices for 9 make the actual time to convergence — 
which is related cubically to nthrcsh(a;), the number of iterations required to reach a par- 
ticular threshold — highly unpredictable. That lack of predictability is compounded over 
thousands of independent local design searches. Eschewing the known (and easily con- 
trolled/augmented) computational costs of a global n, in preference for thousands unknown 
?^thresh(3:^), Hiay uot bc ideal from a practical perspective. 

One can introduce other degrees of freedom to obtain a faster, or higher-fidelity, approx- 
imation. For example, each greedy step searches over a potentially large set, X — Xj{x), 
involving MSPE/ ALC calculations at unlikely Xj^i very far x. Although such local designs 
do not coincide with NN ones, they still comprise primarily of nearby points. One could 
reduce the number of evaluations by searching in order of distance to x, stopping after some 
tolerance is reached on changes to ALC/MSPE. Even simpler is to consider a NN area of 
search candidates that is an order of magnitude larger than n, say n"^ for m-dimensional 
input spaces. We used an simpler default for results in this paper, considering the nearest 
1000 candidates. Results (except time) are unchanged with the nearest 10000. 

We remark that there is a trade-off between the size of local designs and the non- 
stationary flexibility of the global predictor: as the designs get larger, the global surface can 
exhibit less spatial heterogeneity. The approximation will improve relative to the full-data 
GP, but it may exhibit poorer out-of-sample performance, suggesting it could be beneficial to 
stop early. Post-process pruning steps, perhaps via an out-of-sample validation scheme, could 
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be deployed to find n*{x) < n. Another option is to augment the GP with elaborate mean 
structure, following (Kaufman et al. , 2012). That would allow a sparser covariance structure, 



which for us is a sparser local design. That mean structure is one of the big strengths of 
the CSC method, which we anticipate would fare better in extrapolation exercises, or when 
predicting in an area of of the input space with very sparse design coverage. 

When the design and predictive grids are dense, and when the same 6 parameters are 
used globally, the pattern of local designs in the interior of the input space (away from 
the boundaries) are highly regular. Those patterns could be replicated over x, forgoing 
expensive MSPE/ALC calculations. However, regular designs are rare in practice, especially 
for computer experiments where space-filling designs are preferred. For dense grids, perhaps 



structured Gaussian process predictors are a better way forward (e.g. Gilboa et al. , 2012). 

Finally, we see great potential for parallelization aspects of our method. Symmetric- 
multi-processor (SMP) machines are commonplace in modern desktop computing. Numbers 
of cores are doubling every few years and our methods are well-positioned to take advantage. 
Graphical processing units (CPUs) are taking off for scientific computing, having thousands 
of stripped-down computing cores. Some authors have already jumped on that bandwagon 
in the context of computer experiments (e.g., Franey et al. 2012 Eidsvik et al. , 2011). 



Although requiring nontrivial extra programming effort (by comparison with OpenMP, say), 
we have high hopes for our methods in this platform. 
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A Partition inverse equations 



If K{x) 



K k{x) 
(x) K{x,x) 



then K-\x) 



[K ^ + g{x)g~^ {x)m ^(x)] g{x) 
(x) m{x) 



where g{x) = —m{x)K ^k{x) and m ^{x) = K{x,x) — k^{x)K ^k{x). 



B Mean-squared predictive error 

Here we derive the expressions behind the MSPE ^ development in Section |3} We allow 
any form of the correlation function where derivatives are analytic. Although our expres- 
sions in the main body of text assume a scalar parameter 6, the derivations here allow for a 
p-parameter family, where p > 1. Therefore our partial derivatives are expressed component- 
wise and vectorized as appropriate. In most cases, the single-parameter result is obtained 
straightforwardly by removing subscripts and collapsing products of identical second deriva- 
tives into squares and/or factors of two. 
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We begin with the batch log hkehhood derivatives which lead to the Fisher information. 
Define Wj = Kj^Yj, which is a component of the marginal (log) likelihood ([l]) for data Dj 
via ipj. Much of what follows repeatedly leverages that 



d9k 



-k: 



K7\ 



(14) 



where -r^ is the matrix of partial derivatives corresponding to Kj calculated via K0^^{x,x') 
for all (x, x') pairs in Dj. Recursive application yields £, . Then, for 1 < k,l < p, we have 



Also useful is 
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-K 
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Wj + 



dKj dwj dKj dwj 



(15) 



5 log \ Kj\ 
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d\Kj\ 


tr- 













Note that the trace of a product of square matrices M(i) and M^^) may be efficiently com- 
putedasE.^Mi^^M^?). 

Using the above results, the first and second derivatives of the log likelihood ij{0) = 
\ogp{Yj\9, Xj) given in ([T]) are (for 1 < k,l < p) 



and 



MM 



til K7^ 



2^. 



2^, 



' de, ' dOk de,de. 



^3 ^^3 



+ 



Y^R- 

3 3 



KJ^Y, I YjK 



'3 3 



The Fisher information Fj{6) matrix is obtained by negating the {k, I) elements of the second 
derivative above. 

Sequential updating of the Fisher information leverages a recursive expression of the log 
likelihood which follows trivially from a cascading conditional representation of the joint 
probability of the responses given the parameters ij^i{9) = logp(l^_|_i|6') = \ogp{Yj\9) + 
log p{yj+i\Yj, 6) = djiO) + ij{yj+i;9) where the final term represents the conditional log- 
likelihood for given Yj and 6. Taking (negative) second derivatives yields the following 
updating equations 
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Deriving the final term, here, requires expressions for yt+i\Yt, 9, i.e., the predictive equations, 
and their derivatives. First, 



ijiVt+i, 0) = const - - log Vj{xj+i; i 



J + 1 



log 



J -2 + 



Vj{xj+i;e) 



= const log Vj — 

2 



J + 1 



log 



J 



thereby defining some shorthand that will be useful going forward. 

The derivatives of the log likelihood follow immediately from applications of the chain 
rule. For 1 < k,l < p, 
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Eqs. (16-17) require the derivatives of the predictive equations. In turn, those require Wj 
and its derivatives UM, and a new quantity Zj = Kj^kj = K~^kj{xj+i] 9) and its derivatives: 



and 
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(18) 
(19) 



which follow from applications of (|14|). The derivatives of the mean are then revealed as 



d^ij 



dzj 



and 



d^z] 



(20) 



For the variances it helps write Vj = -jZ^^i^j^Vj where Vj = Vj{xj+i; 6) = JC^(xj+i, Xj+i) — 

kj {xj-^-l)K~^kj{xj+l), and -ipj = tpjid). We assume that when the arguments to K^^ri are iden- 
tical, as in Vj, the correlation function is no longer a function of 6. Derivatives then follow 
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as 
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(22) 
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This completes the expressions required for updating the Fisher information matrix. 

The MSPE calculation (g requires that the conditional likelihood be replaced with the 
expectation 

g,^,{9) = F,{9) + E^ 



09,09, 

because we wish to select the next Xj+i in a manner that does not utilize the "future" ob- 
servation Vj+i- Unfortunately, the Student-t predictive equations preclude a tractable 
analytic expectation calculation. Therefore, we approximate by employing Gaussian surro- 
gate equations with matched moments. I.e., 
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2 ^ 2 ^ ^ 2Vi 



(23) 



The (exact) derivatives for the approximation (23) then follow, for 1 < k,l, < p 
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(24) 
(25) 
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Taking the (negative) expectation of (25) gives 
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1 d% 
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Now, given a specified input site x at which predictions of Y{x) are desired, and given 
data Dj containing previously selected sites, the objective is to choose the next site Xj+i 
recognizing that the data pairs will lead to improved predictions via and an improved 

estimate of the correlation parameters 9. If the latter were of primary concern, one possible 
way to choose Xj+i would be to maximize the determinant of Qjj^i{9) for some choice of 
9. While reasonable, this does not directly consider the prediction error variance of Y{x). 
Therefore it possess no mechanism for giving a local character to the estimate of ^. So 
instead we propose to choose x^+i to minimize the predictive variance of Y{x) in a manner 
that considers the impact of estimation uncertainty in 9. Specifically, we choose Xj+i to 
minimize the Bayesian mean squared prediction error defined as 



J{xj+i)=E< Y{x) - ^j+i{x]9j+i) 



(27) 



where /ij+i(x; 6*^+1) depends on (xj+i, yj+i) and represents the hypothetical prediction of 
Y{x) that would result after choosing Xj+i, observing calculating MLE 9j+i based on 
the expanded data set -D^+i, and using the MLE in the prediction of Y{x). Eq. (27) becomes 
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Y, 



+ E\E<[Y{x) - iJ,j+i{x;9)] ij,j+i{x;9) - ijj+i{x;9j+i) Yj,yj+i,9 



Y, 



+ E |E 

--E\v,+i{x; 



fij+i{x; 9) - fij+i{x] 9j+i) Yj, yj+i, 9 



+ E 



IJ,j+i{x;9) - fij+i{x;9j+i) 



Y. 



Regarding the first term in (28), we approximate it as 



E\V,+,{x;9) 



E 



U - 1) 



■Vj+i{x;9) Yj 



(28) 



(29) 
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which can be recognized as the new/reduced variance (10) imphed after adding Xj+i into 
Dj but which does not condition on Uj+i and, therefore, uses the MLE calculated from Dj. 



Regarding the last term in (28), we again approximate as follows. 



de 

djij{x; 0) 



09 



(30) 



The last inequality in (30) involves two approximations. First, the partial derivative of 



jjL jj^i{x] 9) at future is approximated by the partial derivative at the current 9j. The 
second approximation uses the partial derivative of fij{x;9) instead. Both are required for 
tractability because 9j+i and fij^i{x,9) depend on the future i/j+i in a manner that would 
make the expectation in (30) difficult to evaluate analytically. 

Note that the difference between /ij+i(x;6') and fij{x,9), and their partial derivatives 
at a common value of 9, will generally be similar for larger j because the most influential 
points are included in initial iterations. For situations in which yt+i is available in advance, 
one might consider using it for calculating the partial derivative of fij+i{x]9), and thereby 
avoid an approximated expectation. However, even though this would lend tractability and 
accuracy in one sense, it may not be as attractive a criterion as using ^j{x,9). A strategy 
that used the partial derivative of 9) might avoid adding a site at which yj+i was not 

consistent with the other {yi,i = 1,2, .. . ,j}, which would ignore unusual or unpredictable 
features of the response surface. 



Substituting (29) and (30) into (28), and assuming the posterior covariance matrix of 9 



about its MLE can be approximated by the inverse of the Fisher information matrix Qj^i{9j] 
given in ([6]), leaves us with 



j+i) 



+ E 



dfij{x; 9) 
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dfij {x; 9) 



dfij{x; 9) 
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39 
dfij{x; 9) 
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C An interpretation of the reduction in variance cri- 
terion in terms of partial correlations 



The ALC criterion ( 10 ) has a helpful interpretation in terms of the partial correlation between 



the prediction errors at x and at some x', a potential new site to be added into the design. 
For the case when all parameters are known, denote the error in predicting Y{x), given 
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data Dj^ by e{x\Yj) = Y{x) — fij{x;6). Note that e{x\Yj) and Yj are orthogonal under the 
standard correlation inner product, and consider the orthogonal decomposition 



Y{x) = 6) + e{x\Y,) = fi^{x; 9) + p \ ; l ''' e{x'\Y^) + e(x|F„ x') (31) 

where 

^ ' 7Var[e(x|F,-)]VVar[e(x'|F,)] 

denotes the correlation coefficient between the errors in predicting Y{x) and y(x'). Some- 
times this is called the partial correlation coefficient between Y{x) and Y{x') given Yj. 

Because of the orthogonality of the three terms in (31), the error variance in predicting 
Y{x) given Yj is 

Na.i[e{x\Yj)] = p'^Yar[e{x\Yj)] + Yai[e{x\Yj,x')]. (32) 



The right-most term in ( p32| is defined as the error variance in predicting Y{x) after including 
the response at x'. Hence, the term 

p^Var[e(a:|r,)] = Corr^[e(x|F,), e{x'\Y,)]YaT[e{x\Y,)] = ^"""^ ^1^^^^^^^ ^^^^ 

is the reduction in predictive variance in Y{x) after including the response at x' . 
Noting that 

CoY[e{x\Yj),e{x'\Yj)] = t^[K{x,x') - kj {x)Kj^kj{x')], 



it is straightforward to show that (33) is equivalent to the criterion in (10) with ~ 2), 

an estimate of r^, replaced by r^. Because Var[e(x|Yj)] does not depend on x', the expression 
(33 ) for the reduction in variance implies that the next site Xj+i should be chosen to maximize 
the correlation between the errors in predicting Y{x) and Y{xj^i) given Yj. 
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